Not getting in too deep: A practical deep learning approach to routine crystallisation image classification

Using a relatively small training set of ~16 thousand images from macromolecular crystallisation experiments, we compare classification results obtained with four of the most widely-used convolutional deep-learning network architectures that can be implemented without the need for extensive computational resources. We show that the classifiers have different strengths that can be combined to provide an ensemble classifier achieving a classification accuracy comparable to that obtained by a large consortium initiative. We use eight classes to effectively rank the experimental outcomes, thereby providing detailed information that can be used with routine crystallography experiments to automatically identify crystal formation for drug discovery and pave the way for further exploration of the relationship between crystal formation and crystallisation conditions.


Introduction
The determination of protein structures using X-ray crystallography provides insight into the interaction between a protein and target drug compounds in the drug discovery pipeline. Despite technical advances in the crystallisation process, obtaining suitable crystals for X-ray diffraction is still a major bottleneck in drug design [1]. The current laboratory procedure still relies heavily on trial and error, although crystallisation screens have been developed to sample the search space of different conditions intelligently. Many screens from both commercial and academic sources have been created [2][3][4][5][6], providing combinations of chemicals designed to encourage nucleation and maintain a particular pH. Such screens allow tens of thousands of unique conditions to be tested [7], with the protein construct, concentration, suitable ligand selection and experimental temperature chosen by the crystallographer.
The use of robotics now allows nanolitre volume crystallisation experiments and automated storage and imaging systems are routinely used to record the results. In high-throughput centres, thousands of experiments can be produced each day and each experiment may be imaged multiple times over several weeks. Successful crystallisation rates are thought to be less than 5% [8] and optimization experiments are usually required to obtain X-ray diffraction quality results compared to any of the individual classifiers, achieving an accuracy comparable to the 94% reported in the MARCO project. We provide hyper-parameters that were used with all four networks without further optimisation to allow others to obtain similar results with their own images.

Data and labelling
The number of classes used to categorise crystallisation images has varied between studies, ranging from just two, crystal and non-crystal, to as many as ten [20]. If the goal is simply to identify drops containing crystals, then the former might be appropriate but, if the results are to be used in further analyses, for example, to optimise crystallisation conditions and to better understand the crystal formation process, then more classes are required. We have chosen to use the eight classes shown in Table 1 to provide information on the level of success of an experiment. Our categories include three separate crystal classes: shootable representing crystals of a quality that could be sent for X-ray diffraction, optimisable which includes very small crystals, needles and plates and crystalline representing microcrystalline precipitate (often interpreted as"promising looking" by crystallographers  Table 1. Regardless of the number of classes there will always be disagreements on classification by humans with classes that are not discrete. After a training session led by an experienced crystallographer, a subset of 3095 images was labelled independently by three researchers, allowing the level of agreement to be assessed. Of these images, 2745 were unique with an additional set of 50 duplicated 7 times interspersed. Images were labelled in the same order by each classifier and intra-observer bias measured by the number of times that the 50 repeated images were given the same label. At least 86% of the images were given the same label four out of seven times, but scores for more than four agreements dropped drastically with scores for an individual giving the same label all seven times between 10% and 28%. Overall agreement rates between different human classifiers ranged from 50% to 70%, showing the difficulty in classification with continuous outcomes. Given the low agreement rates between classifiers, a set of 5972 images were given labels agreed by three scientists together to produce an initial training set. The model produced with this training data was then used to classify further images and a simple graphical user interface developed to check the results (S1 Fig in S1 File). Any image for which the classification was considered correct was added to the training set with the assigned label, whereas any image considered misclassified was relabelled before being added. This allowed the size of the training set to increase rapidly but reliably in an iterative procedure that continuously improved the classifier.
As each experiment is imaged multiple times, often with little change between consecutive images, only images from days 1, 8 and 13 were ever included in the training set to reduce redundancy.
Images are mainly from two particular projects, with 7,395 images associated with the receptor tyrosine kinases EGFR [21] and c-KIT [22]. A further 8,922 images are from initial crystallisation screens involving several different proteins to provide the model with a greater variety of experimental outcomes and different crystal forms. In total, 16,317 images were used for training the classifier and Table 1 shows the number in each class.
Classification results were validated with three independent test sets. To assess the effect of including very similar images in the training and test sets, our first test set consists of images from time courses represented in the training data but from different batches (i.e. from days other than 1, 8 and 13). The second test consists of images related to proteins represented in the training set but from different experimental plates and the third test set comprises images from independent experiments associated with the chemotherapy drug Trastuzumab [23], for which no images were included in the training set. S2 Fig in S1 File shows the distribution of images in each of the three test sets and Table 2 gives the number of images in each.
As colour images multiply the complexity of models and increase computational cost, our RGB images are converted to grayscale, or more specifically luminant (L), where L = 0.299R + 0.587G + 0.114B. Each image is cropped to 800x800 pixels, covering the protein crystallisation drop. Class imbalance is a fundamental issue in image classification as unbalanced training data results in models that are biased towards the better represented classes. With an estimated crystallisation rate of <5%, it is especially problematic for automated crystallisation image analysis. Although we limited the number of images in other classes, the three crystal classes (crystalline, optimisable and shootable) account for just 20.9% of the training data with just 5.7% labelled shootable. As recognising crystals is a key aim, we incorporated data augmentation into the pre-processing stage to balance the number of images in each training classes. Multiple independent transformations are applied to images from underrepresented classes, these being random rotations, horizontal and vertical flips, and contrast, brightness and resolution adjustments. Each class is duplicated by the rounded inverse of its proportion to the largest class light precipitate. The training set is summarised in Table 1.

Model comparison
The MARCO classifier used an inception-V3 architecture [24] with an additional convolutional layer that reduces the image dimensions from 599x599 to 299x299 pixels. Inception-V3 is a variation on the original Inception module in which a set of convolutional operations occur in parallel before concatenation on output. The V3 module incorporates larger convolutions and regularisation in the form of batch normalisation and label smoothing. This introduces greater complexity but reduces the top-5 error on the ImageNet dataset from 6.67% to 4.2% in comparison to Inception-V1 (GoogLeNet [25]). Advances in image classification are constantly being made, leading to reduced training times and computational cost as well as improved classification rates. The building blocks, or layers, used to create the architecture of a deep learning network vary and are often combined with new architectures to extend and improve previous ideas. The application programming interface (API) Keras [19] compared network architectures using the ImageNet validation set [26] and provided the results in terms of Top-1 accuracy, where the class given the highest probability is correct, and Top-5 accuracy, which means that one of the five highest probability answers is the correct class. The number of parameters involved and training times are also reported. Computational efficiency in relation to accuracy has been investigated previously for the most popular CNNs [27] and, although NASNet-A-Large was found to have the greatest Top-1 and Top-5 accuracies, it also had the greatest computational cost. Here. we chose to investigate the four architectures with the best-reported Top-1 and Top-5 accuracy that did not exceed our GPU capacity, these being: 1. ResNet50 [28] 2. DenseNet121 [29] 3. InceptionV3 [24] 4. Xception [30] The architectures for these four networks are given in the (S3-S8 Figs in S1 File). ResNet provides a popular architecture with shortcuts that allow skipped connections to stabilise training but this can compromise the learning capacity of the network. This is mitigated by DenseNet by concatenating all previous feature maps rather than summing them as in ResNet and results in a 2.4% improvement in Top-1 accuracy on the ImageNet validation data with 5.4 million fewer parameters. Similarly, Xception, aka 'extreme Inception' was designed to improve on InceptionV3 and achieved a 1.1% increase in accuracy on the ImageNet images while using a million fewer parameters.
Network parameters were optimised using DenseNet121 architecture with a subset of 8,000 images from the MARCO project, split into training (80%) and validation (20%) sets. During optimisation, training was carried out for 20 epochs using a cross-entropy loss function and improvements in validation accuracy were identified as individual parameters were changed. Tests indicated that two additional fully connected layers, separated by dropout layers reduced overfitting and increased validation accuracy. The network parameters found to be optimal (S1 Table in S1 File) were used to train new classifiers with the four chosen architectures and the training data shown in Table 1. A consistent batch size of 16 was used for all architectures regardless of complexity, whilst training occurred for 100 epochs in each case with no early stopping. Weights were optimised using stochastic gradient descent with the Adam optimiser and an initial learning rate of 2e −4 . The learning rate was divided by 2 if the loss had not decreased in 5 epochs. Multiple training runs performed with the DenseNet121 classifier with different subsets of the MARCO dataset as validation data achieved accuracies between 91%-94%.
Transfer learning allows the weights from pre-trained models to be re-used as a starting model for new deep learning problems to save time. However, the models provided for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), are based on RGB images and therefore, in order to adapt them to work with our grayscale images, we created two Keras Application models (keras.io/api/applications) for each framework, one accepting RGB images using the ImageNet weights and the other accepting gray scale images with no initial weights specified. The ImageNet weights were transferred to the gray scale model from the layer at which the input shape allowed this. This approach has been shown to improve the efficiency of the model by reducing the number of initial channels without significantly affecting precision [31]. All training was implemented using TensorFlow and Keras and run on four NVIDIA Tesla V100 graphic processing units. The training times per epoch are shown in Table 3. Fig 2 shows the performance of the four classifiers in ROC space when considering the three classes, crystalline, optimisable and shootable together as positive results and all other classes as negatives. As expected, the results for all architectures are best for Test set 1, for which images from the same time series were included in the training data. The ResNet50 classifier stands out as most different with noticeably higher sensitivity for Test sets 1 and 2 at the expense of much lower specificity. There are fewer false positives for Test Set 3, comprising images from experiments on a protein not represented in the training data, with this classifier but the sensitivity is also much lower. The DenseNet121 classifier has similar sensitivity for Tests sets 2 and 3 but with lower specificity for Test set 2, where images from experiments with the same protein but not the same time series were included in the training set. Xception gives very similar results to DenseNet for Test set 3, whereas Inception-V3 produces fewer true positives. However, for Test set 2, Inception-V3 shows a slight improvement over DenseNet, whereas a reduction in false positives is outweighed by a reduction in true positives with Xception. It should be noted that the number of images in each class differs between test sets and that Test set 3 in particular has few positive results (see Table 2). Accuracy is measured as the proportion of classifications that are correct irrespective of class and can therefore be misleading when classes are unbalanced. Cohen's Kappa, originally devised to compare inter-rater agreement, allows for class imbalance when used to assess classification results and so can provide a better representation of performance. Fig 3 shows the Kappa and accuracy scores for the three test sets along with two other common performance metrics, precision and the F1-score. Although the latter two scores are typically used for binary classification, one-vs-all scores can be computed for each class and the weighted average used to provide an overall score. Regardless of the performance metric, it can be seen that Dense-Net121 outperformed the other classifiers on all three test sets. Xception produced the next best results, followed by InceptionV3 and ResNet50 with little to choose between them for Test sets 2 and 3. Results are also shown in S2 Table in S1 File.

Results
It is somewhat surprising that the results obtained for Test set 3, using images from a project not represented in the training data, are consistently better than those for Test set 2, for which images from experiments with same proteins are included. These results are not null) although, apart from null which also has few examples in Test set 3, results for Test set 3 are still closer to those for Test set 1. Fig 5 shows the results for Test sets 2 and 3 together, in confusion matrices where the number of classes has been reduced by combining the two precipitate classes and the three crystal classes.
The lowest class accuracy overall is for phase separation with Xception, being confused with both precipitate and crystals. The Xception classifier also confuses both null and clear images with precipitate as does the InceptionV3 classifier. Other regular errors with Incep-tionV3 are crystals classified as precipitate and phase separation classified as crystals. With ResNet50 the lowest accuracies are for phase separation most often confused with crystal classes and crystals confused with precipitate. As seen earlier, the best results overall are obtained with the DenseNet121 classifier with the lowest accuracy for null images, mainly confused with clear although *10% of precipitate images are classified as crystals.
The differences in misclassifications between classifiers suggest the possibility of improved classification from ensemble classification. Indeed, majority voting with all four classifiers yields improved balanced accuracies over the DenseNet121 classifier of 13% and 19% for phase separation and crystalline examples respectively with a reduction of just 1 or 2% for null and optimisable (Table 4). Table 5 shows improvements of up to 21% for overall accuracy and 26% for Cohen's Kappa when the predictions of all four classifiers are combined (Ensemble4) with improvements of 12% for accuracy and 13% for Kappa over the DenseNet121 classifier. We found that choosing the class randomly in the case of tied votes actually gave slightly worse results than simply taking the class that comes up first and realised that this was due to the order that the predictions were provided. The predictions from the DenseNet121 classifier, shown to be the best individual classifier for our data, are given first and so are given priority in the case of ties. Further investigation showed that changing the order of the other three classifiers did not make any difference to the results. However, as the Xception architecture required so much extra training time, we also tried combining the results without this classifier (Ensemble3). Tables 4 and 5 show that correct classifications fall by 3-4%. Comparison of the accuracies in Table 5 with those achieved during training (97-99%) show that, although there is some overfitting of the individual models, this is greatly reduced for the ensemble models and is therefore not a major problem.

Discussion
The MARCO classifier was trained with over 440 thousand images from five different academic and industrial sources, achieving an accuracy of *94% on test images from these institutions [17]. However, even when the same imaging system is used, subtle differences in the set-up result in much lower accuracies for images from other organisations. Although the classifier is freely available [18] the weights are not, which means that tuning with additional images is not possible and the only way to incorporate new information is to retrain from scratch. Rosa and colleagues characterised the image data used in the original MARCO model investigated the training settings most likely to enhance the local performance of a classification model based on these images [32]. However, we found that using only our own images gave better results than including even subsets of the MARCO images. Our training set is very small in comparison but diverse and carefully labelled and we have shown that similar results are achieved for independent test data whether from projects represented in the training data or not.
We compared four of the most widely-used network architectures, chosen by their performance reported by the Keras API but taking into account computational limitations. It is possible that with greater computational capacity more advanced models, such as inception-v4 and SENet-154, could further improve classification rates as they do with the ImageNet validation data. Each of our chosen architectures has been used in recent studies on the detection of  Previous benchmark analysis of different architectures found that Xception gave the highest accuracy, followed by InceptionV3, ResNet50 and then DenseNet121 [27], whereas, for our data, the DenseNet121 architecture outperformed the other architectures. However, we did obtain very similar accuracy with Xception and, in fact, did give a higher Kappa score for Test sets 2 and 3 combined. Although the ResNet50 architecture produced the worst predictions overall, this classifier had a high crystal detection rate of 93.8%. It does however produce more false positives as can be seen in Fig 2. The training time for the Xception classifier was four times that for InceptionV3, requiring 127 hours over the 100 epochs in comparison to between 31 and 39 hours for the other three classifiers. As the four classifiers had different strengths in terms of particular class accuracies, we were able to combine their results in an ensemble classifier using majority voting and achieve accuracy comparable to that seen in the MARCO project. We consider the increase in accuracy over the best individual classifier to be worth the combined training time.
We have chosen to categorise the results of crystallisation experiments using eight classes rather than the four used in the MARCO projects in order to provide useful information for further analysis. Whereas the MARCO classes combined both good and bad results in the ambiguously named other class, we aim to show just how good an outcome is by having more classes. With the exception of null, which could represent unavoidable technical problem rather than an experimental outcome, our eight classes reflect some ranking, even within the three crystal classes, that could inform further work. We have developed a graphical user interface (GUI) to visualise the results for each experimental plate and Fig 6 shows an example obtained using the Morpheus crystallisation screen [4]. This screen was designed as a grid with columns covering 4 different precipitant-cryoprotectant mixes for three different buffer systems (pH 6.5, 7.5 and 8.5) and rows corresponding to different additives. Patterns can be seen in Fig 6 with certain columns being associated with particular classes, while some additives are noticeable for being different, for example row B (halides) and row G (carboxylic acids). As with the MARCO Polo interface, developed to display and share the results of the MARCO classifier [36], our GUI allows images in the time course to be displayed and provides experimental details. Patterns such as those seen in Fig 6 suggest that successful conditions could potentially be related to particular protein properties, allowing optimal conditions for new proteins to be predicted.

Conclusion
We have shown that a bespoke ensemble classifier can be trained to classify images from crystallisation experiments without the need for supercomputers or a huge training data set. Comparison of four popular deep learning architectures revealed interesting differences in class accuracies between these classifiers. While DenseNet121 provided the best overall accuracy, each architecture was optimal for at least one particular class. The results in Table 4 show that DenseNet121 actually only gave the best results for null images and the two precipitate classes, whereas InceptionV3 gave the highest accuracy for it clear images, Xception was best for phase separation and optimisable images and ResNet50 produced the best results for images in the crystalline and shootable classes. Although the overall accuracy achieved with ResNet50 is 9% lower than that for DenseNet121, this classifier did in fact have an accuracy for crystalline images 10% higher than that with the DenseNet architecture. These differences in misclassification suggested ensemble classification could improve the results and the use of majority voting with all four classifiers increased the overall accuracy from 83%, obtained for the best individual classifier, to 95% and Cohen's Kappa from 81% to 94%.

S1 File. Supplementary information file consisting of supplementary figures and tables.
(PDF)